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An algorithm for discovering Lagrangians automatically from data 
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An activity fundamental to science is building mathematical models. These models are used to 
both predict the results of future experiments and gain insight into the structure of the system 
under study. We present an algorithm that automates the model building process in a scientifically 
principled way. The algorithm can take observed trajectories from a wide variety of mechanical 
systems and, without any other prior knowledge or tuning of parameters, predict the future evolution 
of the system. It does this by applying the principle of least action and searching for the simplest 
Lagrangian that describes the system’s behaviour. By generating this Lagrangian in a human 
interpretable form, it also provides insight into the working of the system. 
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I. INTRODUCTION 

Modern science is, in many senses, highly automated. 
Experiments are frequently run under computer control, 
with data often recorded by the computer directly. Com¬ 
puterised data analysis and visualisation are widely used 
to process the resulting large volumes of data. Indeed, 
the ability to collect and analyse massive data sets is 
opening up an entirely new measure-first-ask-questions- 
later approach to science: the Square Kilometer Array 
radio telescope is expected to collect approximately one 
exabyte of data per day [T]; over 10^"^ collisions from the 
ATLAS detector were analysed in the search for the Higgs 
boson P]; and state-of-the-art whole-genome sequencers 
can currently sequence 600 gigabases per day [3] . In each 
of these examples the scientihc questions are not fully for¬ 
mulated in advance of taking the data, and the question 
of how to best extract knowledge from the dataset is of 
great interest. This motivates the study of how to scale 
up the processes of scientific reasoning to take advantage 
of the wealth of available data. 

Thus far, scientihc reasoning has largely resisted au¬ 
tomation. Hypothesising and rehning models is still on 
the whole carried out by humans, with little direct sup¬ 
port from computers. It has long been a desire of artihcial 
intelligence researchers to automate this part of science, 
and with the growing volume of data available from ex¬ 
periments the motivation for this desire comes ever more 
sharply into focus. In this paper we present a step in this 
direction: an algorithm that automates hnding a mathe¬ 
matical model for a system, in a scientihcally principled 
way, by examining only its observed behaviour. 

Early attempts to automatically model physical sys¬ 
tems searched for simple mathematical regularities in ob¬ 
served quantities. Langley’s BACON system was able 
to re-discover many simple laws — the ideal gas law. 
Ohm’s law. Coulomb’s law and others — from experi¬ 
mental data [4j. Dzeroski and Todorovski went beyond 
simple static laws with their LAGRANGE system which 
was able to search for differential equations that governed 
observed time series [5]. They extended this work to the 
LAGRAMGE system which additionally allowed an ex¬ 
pert user to provide domain knowledge, improving the 


quality of the results [6]. The PRET system, developed 
by Bradley and collaborators, brings to bear a variety 
of advanced AI techniques on the problem of identifying 
system differential equations [3- It has a sophisticated 
method for representing qualitative observations, and al¬ 
lows expert-user domain knowledge to be combined with 
automatic search very effectively. Schmidt and Lipson [5] 
used a genetic programming approach to automatically 
evolve invariant mathematical expressions from observed 
data [5]. In the context of engineering, there is a signif¬ 
icant body of work on ‘system identihcation’, with tech¬ 
niques ranging from very general ad hoc fitting methods 
to fitting detailed physical models representing important 
classes of system [13 E]. 

In this work we take a different approach than those 
described above, the essence of which is that we embed a 
simple, general physical principle — the principle of least 
action — and very little else into our algorithm. While we 
are embedding the domain knowledge of a physicist in our 
algorithm, we are not embedding information about any 
particular physical system or class thereof. Rather we are 
capturing a deep understanding that has been distilled by 
physicists over the past 270 years, and packaging it into 
an algorithm that can be applied by non-experts. We 
find the algorithm to be surprisingly powerful, given its 
simplicity, but this power comes not from the ingenuity 
of its construction, rather from the broad applicability of 
the physical principle embedded in it. 


II. THE PRINCIPLE OF LEAST ACTION 

The principle of least action is one of the most funda¬ 
mental and most celebrated principles in physics. First 
proposed by Maupertuis [T2] and Euler [T3] it states that 
the problem of predicting the behaviour of many physical 
systems can be cast as finding the behaviour that min¬ 
imises the expenditure of some cost function. The total 
expenditure of the cost function by the system is known 
as the action. It is a remarkable fact that the behaviour of 
a very wide range of physical systems — including those 
studied in classical mechanics, special and general rela¬ 
tivity, quantum field theory, and optics — can all have 
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their behaviour explained in terms of minimising a cost 
function. 

Each physical system has its own cost function, and 
once this function is known it is possible to predict ex¬ 
actly what the system will do in the future. The exer¬ 
cise of determining the cost function — often known as 
Lagrange’s function, or just the Lagrangian — for a par¬ 
ticular physical system is central to physics. Feynman 
described this process well [T3] as “some kind of trial 
and error” advising students that “You just have to fid¬ 
dle around with the equations that you know and see if 
you can get them into the form of the principle of least 
action.” In this paper we present an algorithm that does 
this “fiddling” automatically, without requiring the user 
to have any expertise in physics. 

The Lagrangian is ideal as the output of an automated 
modelling algorithm. It has the desirable property that 
it is a single, scalar expression that contains everything 
necessary to predict the system’s future evolution. Con¬ 
sider, in contrast, finding the Hamiltonian where it would 
also be necessary to find the corresponding conjugate mo¬ 
menta. The Lagrangian has the additional quality that it 
is coordinate-independent and as a result can be written 
in any coordinate system. This is useful in the case of 
an automated algorithm where it is not obvious in which 
coordinate system the data might be presented. 

The problem of taking a Lagrangian and automatically 
calculating the resulting motion of the system has been 
widely studied and applied. To the best of our knowl¬ 
edge, this work is the first that solves the inverse prob¬ 
lem of taking the observed motion and calculating the 
Lagrangian for non-trivial systems. 


III. THE ALGORITHM 

To find a model for a system we search over a space of 
possible Lagrangians. To do this we need three elements: 
an objective to guide the search, which will take the form 
of a score function; a representation of the possible La¬ 
grangians; and an algorithm to execute the search over 
the possible Lagrangians, working to improve the score. 
We will first describe the score function, which is the 
central idea of the algorithm. 


A. Score function 

The objective of our algorithm is to find a Lagrangian 
that, when integrated along the system’s observed tra¬ 
jectory, yields a smaller total (action) than when inte¬ 
grated along any neighbouring trajectory. It would be 
possible to implement this definition of the least action 
principle directly in an algorithm, but instead we take an 
indirect approach that is more computationally efficient. 
For a Lagrangian C{9, (p,... ...) and a trajectory 

..., 0{t), p{t ),...) it is possible to write down 
a condition, in the form of a set of differential equations. 


that must be satisfied if the action is to be stationary 
along the trajectory. These differential equations are 
known as Euler-Lagrange’s equations, 

ddC_dC_ 

dt dq dq 

where q G {0, <(),...}. It is to be understood that the 
partial derivatives are taken symbolically with respect to 
the coordinates and velocities, which are then replaced 
with the time-dependent functions from the trajectory 
before the time-derivative is taken. We can define a score 
function based on these conditions. 


EL(/:) 


E 

qG{e,(!>,...} 


d dC 
dt dq 



( 1 ) 


which is zero if the Euler-Lagrange equations are exactly 
satisfied. We note that Hillar and Sommer first proposed 
using a (different) score function derived from the Euler- 
Lagrange equations in m, but did not apply it to finding 
Lagrangians from data. 

In practice our observations of the system are not func¬ 
tions {0{t),(j){t),... ,0{t),<p{t),...) but discretely sam¬ 
pled time-series of the coordinates and generalized ve¬ 
locities. The algorithm operates with a dataset which is 
a time-series of samples 

= (( 0 ( 1 ), ,^( 1 ),..., 0 ( 1 ), <^( 1 ),...),...) 


where time runs from t = 1... TV. We divide this time- 
series into two portions, a training set, comprising sam¬ 
ples 1... M, and a validation set of samples M +1... N. 
The algorithm will conduct its search using only the 
training set, reserving the validation set for out-of-sample 
measurement of the prediction error. In this way we can 
truly test the algorithm’s ability to predict the future dy¬ 
namics of the system. In all of the examples in this paper 
the sampling times will be evenly spaced, but this is not a 
requirement. We can discretize the Euler-Lagrange score 
function § to work with these sampled datasets, giving 


M 


ELd(£)=E E 

t=i qe{e,<j>,...} 


^dC' 
dt dq j 



( 2 ) 


where the subscript on the score indicates that it is taken 
with respect to the dataset D. The square-bracketed 
quantities in this expression are time-series, and the sub¬ 
script indicates taking the element in this time-series at 
the given time. So, for instance, the first term in is 
to be calculated, in principle, by: first differentiating the 
candidate Lagrangian C symbolically with respect to the 
appropriate generalized velocity; evaluating this quantity 
at every time-step in the dataset to yield a new time- 
series; taking the discrete derivative of this new time- 
series with respect to time; and finally finding the el¬ 
ement at time t in this time-derivative time-series. In 
practice, as we shall see below, a more computationally 
efficient implementation may be used. 
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The function ELd is the basis of the score function, 
capturing the principle of least action, but it is not suffi¬ 
cient on its own. While it is true that the Lagrangian we 
seek minimises ELd > tbe converse is not true as there are 
other functions which minimise ELd but are not physi¬ 
cally meaningful Lagrangians. The first class of functions 
that we wish to avoid are those which are numerically 
tiny, for instance £ = We deal with these by 

introducing a normalisation score for each candidate La¬ 
grangian, 


Nd(£) 


E E 




■ d dC 

2 

+ 

'dC 

_dt dq 

t 

dq 


dataset, but even worse on the control data. Overall, the 
score function drives the search to find Lagrangians that 
simultaneously minimise the action along the observed 
trajectory while having a non-zero action along the con¬ 
trol trajectory, and a normalisation score close to one. 

Note that the score function, S does not in any way 
consider whether the prediction of the candidate La¬ 
grangian agrees with the training data. It only considers 
whether the trajectory satisfies a least action principle 
for the candidate Lagrangian. The fact that this, on the 
face of it unrelated, objective leads to successful predic¬ 
tions is the insight from physics that we have embedded 
in our algorithm. 


We will compose our final score from the scores ELd 
and Nd in such a way, to be detailed below, that to 
score well a candidate Lagrangian must simultaneously 
have a low score for ELd and a score of around one for 
Nd. The target value of one for Nd is chosen arbitrarily. 
We can always arrange for the normalisation score to 
be approximately one, as the least-action trajectory is 
unchanged if the Lagrangian is scaled by a constant. 

There is a second, more interesting class, of unwanted 
expressions that minimise the Euler-Lagrange score ELd. 
Consider, for instance, the candidate Lagrangian C — 
9'^0. This Lagrangian satisfies the Euler-Lagrange equa¬ 
tions trivially, in a way that does not depend on the tra¬ 
jectory. Such path-independent least-action Lagrangians 
are interesting from a physics point-of-view, being closely 
related to gauge invariance, but here they are a nuisance. 
To guide the search away from these expressions we intro¬ 
duce a second ‘control’ trajectory, C. This trajectory is 
unrelated to the behaviour of the system under study and 
serves solely to eliminate path-independent Lagrangians. 
We reason that the Lagrangian that we are seeking will 
score well with ELd but should score poorly on ELc, 
which is the Euler-Lagrange score evaluated along the 
control trajectory. The exact form of the control trajec¬ 
tory is unimportant so long as it not a valid trajectory 
of the system under study. In this work we use a control 
trajectory which is uniform motion in each coordinate, 
with velocity arbitrarily chosen to be 0.1, for all experi¬ 
ments. 

We combine the three parts described above to give 
the search score function, 

S{C) = U (Nd(/:)) U (ELc(£)) , (3) 

where U{x) = ln(a; -|- e)^ -I- 1 is a function that is min¬ 
imised, with value approximately one, when the argu¬ 
ment is one. The small constant e, typically set to 
be 10“^°, ensures that the score function has the de¬ 
sired asymptotic behaviour for small values of the nu¬ 
merator and denominator, even when faced with er¬ 
rors from finite precision machine numbers. The fac¬ 
tor U (ELc(£)) prevents the search algorithm from driv¬ 
ing towards Largrangians that perform badly on the real 


B. Representation and search 

We have experimented with two representations of can¬ 
didate Lagrangians. The first, a restricted polynomial 
representation, allows a fast search algorithm to be im¬ 
plemented. It is limited in the Lagrangians it can rep¬ 
resent exactly, although through Taylor’s theorem it can 
find approximations to any Lagrangian. This representa¬ 
tion was used to generate the bulk of the results in this 
paper, and we describe it in detail in this section. The 
second representation lifts some of the constraints of the 
restricted polynomial model, at the expense of vastly in¬ 
creased computational cost. We describe it in section 

m 

The restricted polynomial representation assumes that 
the Lagrangian can be represented by a polynomial in 
the coordinates and velocities. The model is a sum of 
monomial terms, parametrised by coefficients multiply¬ 
ing every term. We restrict this polynomial in two ways: 
we limit the maximum power of any coordinate or veloc¬ 
ity to be m; and we limit the maximum degree of any 
combination of coordinates and velocities to p. In addi¬ 
tion, we remove any terms from the model that can have 
no physical significance, that is terms that are constant 
or of the form q^q. So, for instance, for one variable 9 
with TO = 3 and p = 4 the resulting model would be 

ci9^ + cjfi -I- c^9 -I- -I- c^99^ -I- 060^ -I- C’j9’^(P' -I- c%9^ . 

For a given restricted polynomial model the score func¬ 
tion is minimised by adjusting the parameters c^. We 
conduct this optimisation using the Nelder-Mead simplex 
algorithm m using the modified parameters of Gao and 
Han m which improve the efficiency in high dimensions. 
The coefficients are bounded between -1 and 1, enforced 
by a penalty function. We use a tight convergence tol¬ 
erance, usually one part in 10^°, to encourage the search 
to break out of local minima. We impose a maximum 
iteration limit, usually 5 x 10®, on the search to ensure 
that it is bounded in time. 

We do not know in advance what values of to and 
p are needed to accurately represent the Lagrangian of 
the system under study. What’s more, we wish to find 
the simplest Lagrangian such that the trajectory satisfies 
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FIG. 1. Sketches of the five test systems that we consider. 


the principle of least action. We approach this using a 
simple heuristic algorithm. We start with the smallest 
non-trivial model (m = 2,p = 2) and optimise the pa¬ 
rameters with the simplex search. We then make an in- 
sample prediction of how well the optimised Lagrangian 
predicts the dynamics of the system in the training sam¬ 
ple. This is done by generating equations of motion 
from the optimised Lagrangian and numerically solving 
them, using initial conditions derived from the first sam¬ 
ple in the dataset. If this in-sample prediction fits bet¬ 
ter than a specified tolerance then we stop and return 
the Lagrangian. If it does not fit then we generate a 
larger model (i.e. with larger values of m and/or p) and 
try again. The models are stepped through in increas¬ 
ing number of monomial terms. This proceeds until ei¬ 
ther a model is found that fits or a maximum bound on 
model complexity is reached. This heuristic algorithm 
only crudely captures the notion of mathematical com¬ 
plexity of the model, but it seems to work adequately 
well. 

Note that, for a given polynomial model, it is possible 
to partially pre-calculate the score function ELd for a 
given dataset, yielding a function quadratic in the coeffi¬ 
cients Ci- This is possible because the form of the model 
is fixed and it is possible to calculate its derivatives in ad¬ 
vance. As a result, after the initial simplification of the 
score function, optimisation iterations are fast, and have 
a run-time independent of the number of data points. 

Code for the score function, search algorithms and the 
datasets we use below can be downloaded from [T8|. 


IV. RESULTS 

We will consider five test systems, illustrated in figure 
[2 The first is the unforced Duffing oscillator, a textbook 
non-linear system. The second, a simple pendulum, is 
interesting because its Lagrangian cannot be represented 
exactly in the restricted polynomial representation. The 
third system, two masses on a frictionless surface joined 
by three springs to each other and two immovable walls, 
has two coupled degrees of freedom. The fourth system is 
the double pendulum, a coupled, two degree-of-freedom 
non-linear system capable of chaotic motion. As with the 


simple pendulum, the double pendulum cannot have its 
Lagrangian represented exactly by a finite degree poly¬ 
nomial. The fifth and final system is the Penning-type 
ion-trap, a three degree-of-freedom system with magnetic 
and electrostatic forces, that is of considerable experi¬ 
mental relevance. 

Figure shows the result of applying the algorithm to 
simulated data sets for these systems. It can be seen that 
the algorithm is able to successfully predict the future 
dynamics of all of the test systems. Let us look in detail 
at the progress of the algorithm, and the resulting learned 
models, for two of the example systems. 

In the case of the Duffing oscillator the algorithm tried 
seven, increasingly complex, polynomial models to ar¬ 
rive at the prediction shown, which was generated by the 
model with m = 4 and p = 4. The final model has 11 
free parameters, and required 2160 Nelder-Mead itera¬ 
tions to optimise. The complete search working through 
all seven models, with a single-threaded implementation, 
executes in under five seconds on a 2012 2.0GHz Intel 
Core i7-3667U powered MacBook Air. The optimised La¬ 
grangian, where we have removed terms with coefficients 
less than 10“® and displayed the remaining coefficients 
to two decimal places for clarity, was 

C = -0.30x2 -k 0.14 0.204:2 ^ 

This is exactly the expression, apart perhaps from overall 
scaling, that would be written by a human physicist. The 
coefficients yielded by the search are found to match the 
correct coefficients to the 6th decimal place, limited by 
the convergence tolerance that we set. By generating a 
model in this form the algorithm gives insight into the 
system directly from the data. 

The case of the simple pendulum is also interesting to 
consider. Here the search algorithm tried three models, 
where the third, with m = 2 and p = 4, converged in 
480 Nelder-Mead iterations. The search in this case took 
around 0.6 seconds. The generated model, multiplied by 
100 to make it more readable, was 

£= 0.049x-b8.6x2-3.042-0.0030x42-0.41 x 2 i; 2 . ( 4 ) 

It can be noted that this is not a straightforward Taylor 
expansion of the simple pendulum’s Lagrangian, and it is 
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FIG. 2. Result of running the algorithm on simulated data from the five test systems. In each graph panel, the (blue) open 
circles to the left of the vertical bar are the training data. The solid (red) line, to the right of the vertical bar is the algorithm’s 
prediction. The (red) filled circles to the right of the bar show the actual behaviour of the system. For clarity only every third 
validation point is shown. The algorithm does not have access to these validation points when it is making its prediction. It 
can be seen that the algorithm has accurately learned the dynamics of the system in each case. 
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FIG. 3. Application of the algorithm to data with simulated 
noise added. The graphs are in the same format as hgure[^ 
We see that the algorithm is robust to noise, finding a model 
that accurately predicts the future evolution. 


not obvious how to relate it to the standard form. Exper¬ 
imenting with removing terms and solving the resultant 
equations of motion indicates that the terms proportional 
to X and xx^ are unimportant, but the relatively small 
term in x^x^ is essential. Despite being in an unexpected 
form, this Lagrangian does make successful predictions. 
We shall see in section |V] that it is in fact a local ap¬ 
proximation of a true Lagrangian around the region of 
configuration space that the training trajectory explored. 

Real world measurements are inevitably noisy and so 
to be practically useful it is important that the algorithm 
is able to converge even in the presence of imperfections 
in the data. We took the data for our third test system — 
the coupled harmonic oscillators — and added normally 
distributed noise, with standard deviation 0.1 (about 5% 
of the oscillation amplitude) to the position, velocity, and 
acceleration. Figure shows the result of running the al¬ 
gorithm on this noisy data set. We see that the algorithm 
is robust to this noise, finding a model that describes the 
future evolution well. The algorithm’s convergence was 
slowed slightly taking 2130 iterations, compared to 1740 
for the noise-free case, to converge on the final m = 2, 
p = 2 model. 


V. GENERALISATION 

We have shown that the algorithm can find models 
which successfully predict the future evolution of the sys¬ 
tem’s behaviour. However, a good physical model does 
not just capture the behaviour of a particular time-series, 
corresponding to a particular set of initial conditions. 
Rather, it should be able to predict the behaviour of the 
system over a range of initial conditions. It is perhaps 
this ability to generalise that sets a true physical model 
apart from a mere fit or interpolation of the system’s be¬ 
haviour. It is interesting, therefore, to study whether the 
models found by our algorithm have this property. 


We have seen in the case of the Duffing oscillator that 
the discovered model is indeed the correct model, and we 
would expect that this model will correctly predict the 
dynamics of the system for any initial conditions. We test 
this by simulating the behaviour of the system for a wide 
range of initial conditions, and comparing the results to 
the predictions of the model. We find, as expected, that 
the learned model for the Duffing oscillator does make 
correct predictions for all initial conditions. 

Applying this procedure to the other test systems 
we find that the coupled harmonic oscillators and the 
Penning-type ion trap models also generalise well, mak¬ 
ing successful predictions for all initial conditions. This 
indicates that our algorithm is not merely a sophisticated 
curve fitting routine, but rather is finding the underlying 
physical truth behind the system dynamics to make its 
predictions. 

The pendulum and double pendulum models do not 
generalise well, as we might have anticipated from the 
form of the Lagrangian in Q. Figure compares the 
prediction of the learned simple pendulum model against 
the true behaviour, for a variety of swing amplitudes. We 
see that while the prediction is accurate for the amplitude 
at which the model was trained, it deviates at other am¬ 
plitudes. These results are perhaps to be expected, and 
could well be the same as generated by a human physicist 
given the same data. The algorithm has found a math¬ 
ematically simple approximation that works well for the 
data it has available to it, but does not have enough to 
go on to determine the true underlying model. 

We consider two approaches to generating models that 
generalise better for these systems, inspired by the ap¬ 
proaches a human physicist might take. The first method 
is simply to train the models with more data, correspond¬ 
ing to a wider range of initial conditions. The second is 
to introduce new mathematical constructs which allow a 
simpler model to be found, reasoning that this model is 
more likely to generalise well. 

For the first approach we follow exactly the same pro¬ 
cedure as before except we generate a number of tra¬ 
jectories, corresponding to a range of initial conditions, 
and use a score that is the sum of the scores for the in¬ 
dividual trajectories. We applied this procedure to the 
simple pendulum system. The resulting search takes ap¬ 
proximately 15 times longer to converge than the single¬ 
trajectory search. We find that the algorithm is unable to 
converge on an m = 2, p = 4 model, as it did before, and 
has to continue its search until it finds an to = 4, p = 4 
model whose predictions fit all of the trajectories ade¬ 
quately. Figure [^compares the predictive ability of this 
model with the ‘single-trajectory’ model of the previous 
section. We see that, as shown in figure the single¬ 
trajectory model makes good predictions for the initial 
condition it was trained at, but makes poor predictions 
for other initial conditions. The ‘multi-trajectory’ model, 
though, is much improved. It makes good predictions at 
all of the initial conditions it was trained at, and further 
makes good predictions at other, unseen initial conditions 
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FIG. 4. Predictions for different initial conditions of the learned simple pendulum model (red dashed line) compared to the 
true behaviour (blue solid line). The amplitude of the pendulum swing varies between panels. The model was trained at the 
amplitude shown in the final panel. We see that the model makes good predictions for the initial conditions it was trained on, 
but breaks down for other initial conditions. 


as well. We have found similar results for the double¬ 
pendulum system, although the computational difficulty 
of the problem constrained the experiment to a limited 
region of initial-condition-space. 

Our second approach to generalisation is to expand the 
representation of the Lagrangians to encompass a wider 
range of mathematical expressions. We reason that, with 
a wider palette of mathematics at its disposal, the algo¬ 
rithm may be able to find a model of simpler form that 
works well. History has shown, although this may be tau¬ 
tological, that often systems of interest to physics can 
be described by remarkably simple mathematical mod¬ 
els. We hope that by allowing the algorithm to generate 
structurally simpler models, it may be more likely to dis¬ 
cover the underlying physical truth. 

We have developed a proof-of-principle implementa¬ 
tion of a richer representation, and a corresponding 
search algorithm, detailed in the appendix. Briefly, we 
compose mathematical expressions as trees with leaf- 
nodes corresponding to the system variables, simple func¬ 
tions (sine, cosine, square) of these variables, and numer¬ 
ical constants. Branch-nodes of the tree are arithmetic 
operators -b, —, x. This structure can represent a much 
wider range of mathematical forms than our polynomial 
representation. We search over this tree-structured rep¬ 
resentation using a genetic programming technique that 
simultaneously tries to optimise the score and minimise 
the size of the trees. Thus, this search algorithm tries 
explicitly to find simple expressions that score well on 
the data. 

Repeating the search on large-amplitude (±0.957r) sim¬ 
ple pendulum data using the tree-based representation 
highlights the relative strength of this approach. The 


generated model, which makes a successful prediction, is 
C = 0.25 0^ -b2.0cos(6») , 

the same as would be written by a human physicist. Nat¬ 
urally, this model makes correct predictions over the full 
range of initial conditions. There are two reasons that the 
tree-based expression search is able to converge on this 
model. First it is only because the representation of pos¬ 
sible models is richer that this model can be directly rep¬ 
resented at all. Second, the notion of mathematical com¬ 
plexity in this representation more closely models that of 
a human physicist. This allows the search algorithm to 
do more work driving the result towards an expression 
that we recognize as canonical. It must be cautioned, 
though, that this is only a proof-of-principle demonstra¬ 
tion. To reach this result we had to bias the search algo¬ 
rithm, as described in the appendix, and even then the 
run time is significantly longer, often taking many hours 
with a multi-threaded implementation on the hardware 
described above. We were not able to get results for the 
double pendulum system at all with the computing re¬ 
sources at our disposal. Nonetheless, we present this re¬ 
sult as the technique shows potential for learning models 
that are both better able to generalise, and in a format 
more suitable for communication to human physicists. 

VI. CONCLUSION 

We have demonstrated an algorithm that can predict 
the future dynamics of a physical system directly from 
observed data. We have shown that the algorithm gener¬ 
ates models that can be communicated to a human physi¬ 
cist, sometimes even finding models in textbook form. 
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FIG. 5. Comparing a model for the simple pendulum trained 
on multiple trajectories with one trained on a single trajec¬ 
tory. The curves show the squared error between the model’s 
prediction and the true behaviour, as a function of the pen¬ 
dulum’s swing amplitude. The dashed (blue) curve shows the 
result for a model trained at a single swing amplitude, indi¬ 
cated by the heavy (blue) arrow. This model performs well 
at the amplitude it was trained at, but poorly at other am¬ 
plitudes. The model corresponding to the solid (red) curve 
was trained with multiple trajectories, indicated by the other 
(red) arrows. The original trajectory was also included in the 
training set for this model. We see that the ‘multi-trajectory’ 
model makes better predictions across a wide range of initial 
conditions, including conditions that it was not trained on. 
The multi-trajectory model is able to make successful predic¬ 
tions up to surprisingly large amplitudes, well beyond those 
it has seen in training. 


We have further shown that the models generated gener¬ 
alise well to unseen data, and are not merely fits or in¬ 
terpolations, but are truly capturing the physical essence 
of the system under study. 

One might ask what the use of such an algorithm is. As 
a first point, we find the question of whether a computer 
can do science to be fascinating in itself. Investigating 
the limits of a computer’s ability in this regard educates 
us as to the strengths and weaknesses of our current sci¬ 
entific processes, and invites us to consider a different 
perspective on our scientific work. 

But perhaps a more practical answer is that tools such 
as this could assist humans in their work. We see this 
assistance as coming in two forms. The first is simply 
automating the actions of a scientist so they can be ap¬ 
plied to more data. Techniques that can be automatically 
applied to datasets, scanning for scientifically interesting 
features — in the case of the algorithm in this paper, 
for example, finding that there is a least action princi¬ 
ple at work — may come to be a fruitful approach to 
generating unexpected scientific leads as we head into a 
data-dominated era. The second is opening up the tech¬ 
niques of science to non-specialists. By capturing the 
idea of searching for least action models in an algorithm 
we make it available to anyone, including those without 
the necessary skills to do it by hand. It is interesting in 
this regard to consider popular online natural language 
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FIG. 6. An expression tree representing 9 ^ — 

translation software. While no-one would consider these 
tools suitable for translating poetry, they nonetheless are 
exceedingly useful to many people in the common case 
where a ‘good-enough’ translation will do. While we do 
not imagine computers will replace expert human physi¬ 
cists in the near term, we envisage the availability of 
tools to automate scientific reasoning will empower non¬ 
specialists to take better advantage of the discoveries and 
insights of physics. 
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Appendix: Tree-based representation and search 

Here we describe in detail the tree-structured symbolic 
representation of mathematical expressions and the cor¬ 
responding search routine. The expressions are built fol¬ 
lowing a grammar designed to bias the search towards 
expressions that might be Lagrangians for simple physi¬ 
cal systems. The terminals of these expression trees are 
one of: numerical constants, randomly generated; the 
coordinate variables and velocities; the squares of the co¬ 
ordinates and velocities; and for coordinates which rep¬ 
resent angles, the sine and cosine of the coordinates and 
velocities. The non-terminal nodes of the trees are the 
operators , x. An example expression tree is shown 
in figure]^ representing the expression — 3cos(0). 

To search through these tree-structured expressions we 
take a genetic programming approach m explicitly op¬ 
timising both the least-action score ([3) and also a com¬ 
plexity score, using the SPEA2 multi-objective optimi¬ 
sation algorithm |50]. This biologically-inspired evolu¬ 
tionary algorithm maintains a population of candidate 
expressions and breeds, reproduces, and mutates them 
to try and simultaneously optimise the least-action and 
complexity scores. 

In detail, we first construct a population of randomly 
generated expressions, usually numbering 100. We score 
these expressions using the least-action score and also 
assign a complexity score which is simply the number of 
nodes in the expression tree. The SPEA2 algorithm takes 
the current population, and an initially empty set of elite 
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expressions, representing the best that have been seen so 
far. It has a rather complex selection mechanism [20] 
that produces a new set of elite expressions, plus a set 
of expressions, the breeding pool, which are candidates 
for reproduction. A new generation is created from the 
breeding pool by mutation (10%) and pair-wise crossover 
(90%). Mutation is effected by replacing a random sub¬ 
tree of the given expression with a randomly generated 
subtree. The crossover operation takes two expressions, 
selects a random point in each of the two trees, and swaps 
the sub-trees at these points to generate two new ex¬ 
pressions. The evolutionary process is repeated starting 
from this new generation, and we iterate for a large num¬ 
ber of generations, typically many thousand. To improve 
the convergence speed of the numeric constants in the 
expressions we also incorporate a small amount of hill- 
descent into each evolutionary iteration: a subset (20%) 
of the expressions have their numeric constants randomly 
adjusted by a small amount, and if this improves their 
least-action score, the modification is kept. We also im¬ 
pose a maximum size of expression (50 nodes) and trim 
expressions that exceed it each generation to ensure that 
the run-time is bounded. The final elite set is a set of 


expressions that represent the trade-off between least- 
action score and complexity. We select from this set the 
simplest expression that has a least-action error below a 
specified threshold as the output of the run. 

In this tree-based method, the structure of the candi¬ 
date Lagrangians varies during the search, so it is not 
possible to partially pre-calculate the score function, as 
it was in the restricted polynomial technique. Rather it 
must be calculated in full for each expression in the popu¬ 
lation. Further, the search space of possible expressions is 
exceedingly large, and the score is not very smooth with 
respect to the genetic operations. As a result, the search 
must run for many generations and is extremely compu¬ 
tationally expensive. A naive implementation might cal¬ 
culate the partial derivative time-series in (§ by symbol¬ 
ically differentiating the candidate expression and then 
calculating the value of the derivative. This, however, 
can be exponentially expensive in the depth of the ex¬ 
pression, in terms of both memory and runtime. A better 
approach, that we adopt in this work, is to simultane¬ 
ously evaluate the expression’s value and its derivatives 
using automatic differentiation m- This method avoids 
calculating an expression for the derivative, and has run¬ 
time proportional to the size of the expression. 
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